1
迈向生产环境:部署思维
EvoClass-AI002Lecture 10
00:00

迈向生产环境:部署思维

本模块作为最后一步,弥合了在笔记本中实现高准确率的研究与可靠执行之间的差距。部署是将PyTorch模型转化为一个极简、 自包含服务 的关键过程,该服务能够以低延迟高效地向终端用户返回预测结果,并具备 高可用性

1. 生产环境思维的转变

Jupyter Notebook的探索性环境具有状态依赖且对生产环境而言十分脆弱。我们必须将代码从探索性脚本重构为结构化、模块化的组件,使其适用于并发请求处理、资源优化以及无缝集成到更大的系统中。

低延迟推理: 持续将预测时间控制在目标阈值以下(例如 $50\text{ms}$),这对实时应用至关重要。
高可用性: 设计服务时需确保其可靠、无状态,并能在发生故障后快速恢复。
可复现性: 确保已部署的模型及其环境(依赖项、权重、配置)与经过验证的研究成果完全一致。
关注点:模型服务
我们不应当部署完整的训练脚本,而是部署一个极简的、自包含的服务封装。该服务仅需完成三项任务:加载优化后的模型文件,应用输入预处理,执行前向传播并返回预测结果。
inference_service.py
TERMINALbash — uvicorn-service
> Ready. Click "Simulate Deployment Flow" to run.
>
ARTIFACT INSPECTOR Live

Simulate flow to view loaded production artifacts.
Question 1
Which feature of a Jupyter notebook makes it unsuitable for production deployment?
It primarily uses Python code
It is inherently stateful and resource-intensive
It cannot directly access the GPU
Question 2
What is the primary purpose of converting a PyTorch model to TorchScript or ONNX before deployment?
Optimization for faster C++ execution and reduced Python dependency
To prevent model theft or reverse engineering
To automatically handle input data preprocessing
Question 3
When designing a production API, when should the model weights be loaded?
Once, when the service initializes
At the start of every prediction request
When the first request to the service is received
Challenge: Defining the Minimal Service
Plan the structural requirements for a low-latency service.
You need to deploy a complex image classification model ($1\text{GB}$) that requires specialized image preprocessing. It must handle $50$ requests per second.
Step 1
To ensure high throughput and low average latency, what is the single most critical structural change needed for the Python script?
Solution:
Refactor the codebase into isolated modules (Preprocessing, Model Definition, Inference Runner) and ensure the entire process is packaged for containerization.
Step 2
What is the minimum necessary "artifact" to ship, besides the trained weights?
Solution:
The exact code/class definition used for preprocessing and the model architecture definition, serialized and coupled with the weights.